library(tidyverse)
library(tidymodels)
library(gridExtra)
library(ISLR2)
library(leaps)Chapter 6 Part 2
Shrinkage
Setup
Shrinkage Methods
Ridge regression and Lasso - The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.
As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.
It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.
Another Reason
- Sometimes we can’t solve for \(\hat\beta\)
- Why?
- Sometimes we can’t solve for \(\hat\beta\)
- We have more variables than observations ( \(p > n\) )
- The variables are linear combinations of one another
- The variance can blow up
What can we do about this?
Ridge Regression
- What if we add an additional penalty to keep the \(\hat\beta\) coefficients small (this will keep the variance from blowing up!)
- Instead of minimizing \(RSS\), like we do with linear regression, let’s minimize \(RSS\) PLUS some penalty function
- \(RSS + \underbrace{\lambda\sum_{j=1}^p\beta^2_j}_{\textrm{shrinkage penalty}}\)
- What happens when \(\lambda=0\)? What happens as \(\lambda\rightarrow\infty\)?
Ridge Regression
- Recall, the least squares fitting procedure estimates \(\beta_0,...,\beta_p\) using the values that minimize \[RSS = \sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2\]
- Ridge regression coefficient estimates, \(\hat{\beta}^R\) are the values that minimize
\[\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2+\lambda\sum_{j=1}^p\beta_j^2\]
\[ = RSS + \lambda\sum_{j=1}^p\beta_j^2\]
where \(\lambda\geq 0\) is a tuning parameter, to be determined separately
More on Ridge
Like least squares, ridge regression seeks coefficient estimates taht fit the data well by making the RSS small.
The second term \(\lambda\sum_j\beta_j^2\) is called a shrinkage penalty, is small when \(\beta_1,...\beta_p\) are close to 0, and so it has the effect of shrinking the estimates of \(\beta_j\) toward 0.
Shinkage
Each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of \(\lambda\).
Shinkage Coeff
This displays the same ridge coefficient estimates as the previous graphs, but instead of displaying \(\lambda\) on the x-axis, we now display \(||\hat{\beta}_\lambda^R||_2/||\hat{\beta}||_2\), where \(\hat{\beta}\) denotes the vector of the least squares coefficient estimates.
The notation \(||\beta||_2\) denotes the
Ridge - Scalling Predictors
The standard least squares coefficient estimates are scale equivalent: multiplying \(X_j\) by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of \(1=c\). In other words, regardless of how the jth predictor is scaled, \(X_j\hat{\beta}_j\) will remain the same.
In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.
Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula
\[\tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{2}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}}\]
Ridge Regression
- IMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)
Choosing \(\lambda\)
- \(\lambda\) is known as a tuning parameter and is selected using cross validation
- For example, choose the \(\lambda\) that results in the smallest estimated test error
Bias-variance tradeoff
How do you think ridge regression fits into the bias-variance tradeoff?
- As \(\lambda\) ☝️, bias ☝️, variance 👇
Ridge Bias-variance tradeoff
Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coefficients. Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of \(\lambda\) and \(||\hat{\beta}_\lambda^R||_2/||\hat{\beta}||_2\). The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest.
Lasso
Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model
The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, \(\hat{\beta}_\lambda^L\), minimize the quantity
\[\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2+\lambda\sum_{j=1}^p|\beta_j|\]
\[ = RSS + \lambda\sum_{j=1}^p|\beta_j|\]
where \(\lambda\geq 0\) is a tuning parameter, to be determined separately
- In statistics lingo, the lasso uses an \(\ell_1\) (pronounced “ell 1”) penalty instead of an \(\ell_2\) penalty. The \(\ell_1\) norm of a coefficient vector \(\beta\) is given by \(||\beta||_1 = \sum|\beta_j|\)